D² Tweedie Score (d2_tweedie_score)#
d2_tweedie_score is a relative-to-baseline regression score:
“What fraction of the Tweedie deviance did my model explain compared to just predicting the mean?”
It generalizes \(R^2\) to non-Gaussian targets (counts, positive skewed data, and zero-mass + continuous data) via the Tweedie family.
Learning goals#
By the end you should be able to:
define \(D^2\) as “fraction of Tweedie deviance explained”
understand the power parameter \(p\) and the allowed target/prediction domains
implement
mean_tweedie_devianceandd2_tweedie_scorefrom scratch in NumPy (incl. sample weights)visualize how different \(p\) values change the penalty for under/over-prediction
optimize a simple Poisson regression (a Tweedie GLM with \(p=1\)) and track \(D^2\) during training
Quick import#
from sklearn.metrics import d2_tweedie_score
Prerequisites#
logs and exponentials
weighted averages
basic regression notation (\(y\), \(\hat{y}\))
(optional) GLMs / deviance
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import d2_tweedie_score as sk_d2_tweedie_score
from sklearn.metrics import mean_tweedie_deviance as sk_mean_tweedie_deviance
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition: “fraction of Tweedie deviance explained”#
Given targets \(y_1,\dots,y_n\) and predictions \(\hat{\mu}_1,\dots,\hat{\mu}_n\), define the (possibly weighted) mean Tweedie deviance:
The \(D^2\) Tweedie score compares your model to the constant baseline that predicts the (weighted) mean:
Interpretation (just like \(R^2\)):
\(D^2_p = 1\): perfect predictions (zero deviance)
\(D^2_p = 0\): no improvement over predicting the mean
\(D^2_p < 0\): worse than the mean baseline
Special case: \(p=0\) uses squared-error deviance, so \(D^2_0\) equals \(R^2\).
# Small example: same y_true, different predictors, different powers
y_true = np.array([0.5, 1.0, 2.5, 7.0])
preds = {
"Mean baseline": np.full_like(y_true, y_true.mean()),
"Example model": np.array([1.0, 1.0, 5.0, 3.5]),
"Perfect": y_true.copy(),
"Bad constant": np.full_like(y_true, 10.0),
}
powers = [0, 1, 2]
fig = go.Figure()
for p in powers:
scores = [sk_d2_tweedie_score(y_true, y_pred, power=p) for y_pred in preds.values()]
fig.add_trace(go.Bar(name=f"power={p}", x=list(preds.keys()), y=scores))
fig.update_layout(
title="D² Tweedie score across different powers (same data, same predictions)",
barmode="group",
xaxis_title="Predictor",
yaxis_title="D² (higher is better)",
width=950,
height=420,
)
fig.show()
2) Tweedie family + Tweedie deviance (the loss behind \(D^2\))#
The Tweedie family is an exponential-dispersion model where the variance scales with the mean:
Common cases:
\(p\) |
Distribution |
Typical target domain |
|---|---|---|
\(0\) |
Normal |
\(y\in\mathbb{R}\) |
\(1\) |
Poisson |
\(y\ge 0\) (counts) |
\(1<p<2\) |
Compound Poisson–Gamma |
\(y\ge 0\) (mass at 0 + continuous \(>0\)) |
\(2\) |
Gamma |
\(y>0\) |
\(3\) |
Inverse Gaussian |
\(y>0\) |
In a GLM, we model a mean \(\mu_i = \mathbb{E}[y_i\mid x_i]\). To keep \(\mu_i>0\) we often use a log link:
Unit deviance \(d_p(y,\mu)\)#
mean_tweedie_deviance is the average of a per-sample unit deviance \(d_p(y,\mu)\) (with the convention \(0\log 0 = 0\)).
Special cases:
General case (\(p \notin \{0,1,2\}\)):
Domain requirements (matching scikit-learn):
\(p < 0\): \(\mu>0\) (and the formula uses \(\max(y,0)^{2-p}\) to avoid non-real powers)
\(p = 0\): \(y,\mu\in\mathbb{R}\)
\(1 \le p < 2\): \(y\ge 0\), \(\mu>0\)
\(p \ge 2\): \(y>0\), \(\mu>0\)
Derivative (useful for optimization)#
For all \(p\) (via limits at \(p=0,1,2\)):
So \(d_p\) is minimized at \(\mu=y\), but the curvature / asymmetry depends on \(p\).
def _xlogy(x, y):
"""Compute x * log(y) with the convention 0 * log(y) = 0."""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
bx, by = np.broadcast_arrays(x, y)
out = np.zeros_like(bx, dtype=float)
mask = bx != 0
out[mask] = bx[mask] * np.log(by[mask])
return out
def _validate_power(power: float) -> float:
p = float(power)
if not (p <= 0 or p >= 1):
raise ValueError(f"power must satisfy p <= 0 or p >= 1, got p={p}")
return p
def _check_1d_targets(y_true, y_pred, sample_weight=None):
y_true = np.asarray(y_true, dtype=float).reshape(-1)
y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
if y_true.shape != y_pred.shape:
raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")
if sample_weight is None:
return y_true, y_pred, None
w = np.asarray(sample_weight, dtype=float).reshape(-1)
if w.shape[0] != y_true.shape[0]:
raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
if float(np.sum(w)) == 0.0:
raise ValueError("Sum of sample_weight must be > 0")
return y_true, y_pred, w
def tweedie_unit_deviance_numpy(y_true, y_pred, *, power: float = 0.0) -> np.ndarray:
"""Per-sample Tweedie unit deviance d_p(y, μ) (NumPy).
This matches scikit-learn's conventions and input domain checks.
"""
y_true = np.asarray(y_true, dtype=float).reshape(-1)
y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
if y_true.shape != y_pred.shape:
raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")
p = _validate_power(power)
message = f"Mean Tweedie deviance error with power={p} can only be used on "
if p < 0:
# 'Extreme stable', y any real number, y_pred > 0
if np.any(y_pred <= 0):
raise ValueError(message + "strictly positive y_pred.")
elif p == 0:
# Normal, y and y_pred can be any real number
pass
elif 1 <= p < 2:
# Poisson and compound Poisson distribution, y >= 0, y_pred > 0
if np.any(y_true < 0) or np.any(y_pred <= 0):
raise ValueError(message + "non-negative y and strictly positive y_pred.")
elif p >= 2:
# Gamma / Inverse Gaussian / positive stable, y and y_pred > 0
if np.any(y_true <= 0) or np.any(y_pred <= 0):
raise ValueError(message + "strictly positive y and y_pred.")
zero = 0.0
if p < 0:
# Prevent non-real powers for negative targets
y_pos = np.where(y_true > 0, y_true, zero)
dev = 2 * (
np.power(y_pos, 2 - p) / ((1 - p) * (2 - p))
- y_true * np.power(y_pred, 1 - p) / (1 - p)
+ np.power(y_pred, 2 - p) / (2 - p)
)
elif p == 0:
dev = (y_true - y_pred) ** 2
elif p == 1:
dev = 2 * (_xlogy(y_true, y_true / y_pred) - y_true + y_pred)
elif p == 2:
dev = 2 * (np.log(y_pred / y_true) + y_true / y_pred - 1)
else:
dev = 2 * (
np.power(y_true, 2 - p) / ((1 - p) * (2 - p))
- y_true * np.power(y_pred, 1 - p) / (1 - p)
+ np.power(y_pred, 2 - p) / (2 - p)
)
return dev
def mean_tweedie_deviance_numpy(y_true, y_pred, *, power: float = 0.0, sample_weight=None) -> float:
"""Mean Tweedie deviance (NumPy)."""
y_true, y_pred, w = _check_1d_targets(y_true, y_pred, sample_weight)
dev = tweedie_unit_deviance_numpy(y_true, y_pred, power=power)
if w is None:
return float(np.mean(dev))
return float(np.average(dev, weights=w))
def d2_tweedie_score_numpy(y_true, y_pred, *, power: float = 0.0, sample_weight=None) -> float:
"""NumPy implementation of scikit-learn's d2_tweedie_score.
Notes
-----
- Best score is 1.0; it can be negative.
- Returns NaN if n_samples < 2.
- Can also return NaN if the baseline deviance is 0 (e.g., constant targets).
"""
y_true, y_pred, w = _check_1d_targets(y_true, y_pred, sample_weight)
if y_true.shape[0] < 2:
return float("nan")
numerator = mean_tweedie_deviance_numpy(y_true, y_pred, power=power, sample_weight=w)
y_avg = float(np.average(y_true, weights=w)) if w is not None else float(np.mean(y_true))
denominator = mean_tweedie_deviance_numpy(
y_true,
np.full_like(y_true, y_avg),
power=power,
sample_weight=w,
)
with np.errstate(divide="ignore", invalid="ignore"):
return float(1.0 - numerator / denominator)
# Quick checks vs scikit-learn (should match exactly)
for p in [0, 1, 1.5, 2, 3, -0.5]:
if p == 0:
y_true = rng.normal(size=200)
y_pred = y_true + rng.normal(scale=0.3, size=200)
elif p < 0:
y_true = rng.normal(size=200)
y_pred = np.exp(rng.normal(size=200))
elif 1 <= p < 2:
y_true = rng.poisson(3.0, size=200).astype(float)
y_pred = np.exp(rng.normal(size=200))
else:
y_true = rng.gamma(shape=2.0, scale=2.0, size=200)
y_pred = np.exp(rng.normal(size=200))
w = rng.uniform(0.2, 2.0, size=200)
sk_loss = sk_mean_tweedie_deviance(y_true, y_pred, power=p, sample_weight=w)
np_loss = mean_tweedie_deviance_numpy(y_true, y_pred, power=p, sample_weight=w)
sk_d2 = sk_d2_tweedie_score(y_true, y_pred, power=p, sample_weight=w)
np_d2 = d2_tweedie_score_numpy(y_true, y_pred, power=p, sample_weight=w)
print(f"power={p:>4}: mean dev diff={abs(sk_loss-np_loss):.3e} | D² diff={abs(sk_d2-np_d2):.3e}")
power= 0: mean dev diff=0.000e+00 | D² diff=0.000e+00
power= 1: mean dev diff=0.000e+00 | D² diff=0.000e+00
power= 1.5: mean dev diff=0.000e+00 | D² diff=0.000e+00
power= 2: mean dev diff=0.000e+00 | D² diff=0.000e+00
power= 3: mean dev diff=0.000e+00 | D² diff=0.000e+00
power=-0.5: mean dev diff=0.000e+00 | D² diff=0.000e+00
3) Intuition: how \(p\) changes the penalty#
Two important intuitions:
Asymmetry for \(p\ge 1\): underpredicting towards \(\mu\to 0^+\) can be extremely costly when \(y>0\).
Mean-dependent scaling: the derivative
shows that (for \(p>0\)) errors at larger \(\mu\) are downweighted by \(\mu^{-p}\). This is the idea behind using Tweedie losses when variance grows with the mean.
Below are deviance curves (lower is better). Notice how the shape changes with \(p\).
# Unit deviance curves for different powers
mu_pos = np.linspace(1e-3, 15, 500)
y_fixed = 5.0
powers_main = [0, 1, 1.5, 2, 3]
mu_zero = np.linspace(1e-6, 10, 500)
y_zero = 0.0
powers_zero = [0, 1, 1.5]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("y = 5 (compare powers)", "y = 0 (compare how predicting >0 is penalized)"),
)
for p in powers_main:
dev = tweedie_unit_deviance_numpy(np.full_like(mu_pos, y_fixed), mu_pos, power=p)
fig.add_trace(go.Scatter(x=mu_pos, y=dev, mode="lines", name=f"p={p}"), row=1, col=1)
for p in powers_zero:
dev = tweedie_unit_deviance_numpy(np.full_like(mu_zero, y_zero), mu_zero, power=p)
fig.add_trace(go.Scatter(x=mu_zero, y=dev, mode="lines", name=f"p={p}"), row=1, col=2)
fig.add_vline(x=y_fixed, line_dash="dash", line_color="gray", row=1, col=1)
fig.update_xaxes(title_text="Predicted mean μ", row=1, col=1)
fig.update_yaxes(title_text="Unit deviance d_p(y,μ)", row=1, col=1)
fig.update_xaxes(title_text="Predicted mean μ", row=1, col=2)
fig.update_yaxes(title_text="Unit deviance d_p(y,μ)", row=1, col=2)
fig.update_layout(
title="Tweedie unit deviance curves",
width=1100,
height=450,
legend_title_text="power",
)
fig.show()
# D² as a function of multiplicative mis-calibration of the predicted mean
n = 600
x = rng.normal(size=n)
X = np.c_[np.ones(n), x]
beta_true = np.array([0.2, 0.7])
mu_true = np.exp(X @ beta_true)
y = rng.poisson(mu_true)
scales = np.linspace(0.2, 3.0, 90)
powers = [0, 1, 1.5]
fig = go.Figure()
for p in powers:
d2_vals = [d2_tweedie_score_numpy(y, mu_true * s, power=p) for s in scales]
fig.add_trace(go.Scatter(x=scales, y=d2_vals, mode="lines", name=f"power={p}"))
fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.update_layout(
title="D² vs multiplicative mis-calibration of μ (same y_true, same mu_true)",
xaxis_title="Scale factor s in μ_pred = s · μ_true",
yaxis_title="D² (higher is better)",
width=950,
height=450,
)
fig.show()
4) Using \(D^2\) while optimizing a model (from scratch)#
In
the denominator depends only on \(y\) (and weights), so during training it is a constant.
Maximizing \(D^2_p\) is therefore equivalent to minimizing the mean Tweedie deviance \(\bar{d}_p(y,\hat{\mu})\).
We’ll fit a simple Poisson regression (Tweedie with \(p=1\)) using a log link:
Using the derivative from above and the chain rule, for general \(p\):
For Poisson (\(p=1\)): \(\frac{\partial}{\partial \eta_i} d_1 = 2(\mu_i - y_i)\).
# Fit a Poisson GLM with gradient descent and track both deviance and D²
n = 400
x = rng.uniform(-2.0, 2.0, size=n)
X = np.c_[np.ones(n), x]
beta_true = np.array([0.3, 0.8])
mu_true = np.exp(X @ beta_true)
y = rng.poisson(mu_true)
def fit_tweedie_glm_gd(X, y, *, power=1.0, lr=0.05, n_iter=3000, record_every=10):
"""Gradient descent for a Tweedie GLM with log link: μ = exp(Xβ)."""
p = float(power)
n_samples, n_features = X.shape
beta = np.zeros(n_features, dtype=float)
hist = {"iter": [], "loss": [], "d2": []}
for t in range(n_iter + 1):
eta = X @ beta
mu = np.exp(np.clip(eta, -20, 20))
if t % record_every == 0:
hist["iter"].append(t)
hist["loss"].append(mean_tweedie_deviance_numpy(y, mu, power=p))
hist["d2"].append(d2_tweedie_score_numpy(y, mu, power=p))
# ∂/∂η d_p(y, μ) = 2(μ^{2-p} - y μ^{1-p}) for μ = exp(η)
g_eta = 2.0 * (mu ** (2.0 - p) - y * (mu ** (1.0 - p)))
grad = (X.T @ g_eta) / n_samples
beta -= lr * grad
return beta, hist
beta_hat, hist = fit_tweedie_glm_gd(X, y, power=1.0, lr=0.05, n_iter=3000, record_every=10)
beta_hat
fig = make_subplots(rows=1, cols=2, subplot_titles=("Mean Poisson deviance", "D² (power=1)"))
fig.add_trace(go.Scatter(x=hist["iter"], y=hist["loss"], mode="lines", name="loss"), row=1, col=1)
fig.add_trace(go.Scatter(x=hist["iter"], y=hist["d2"], mode="lines", name="D²"), row=1, col=2)
fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=2)
fig.update_xaxes(title_text="Iteration", row=1, col=1)
fig.update_yaxes(title_text="Mean deviance", row=1, col=1)
fig.update_xaxes(title_text="Iteration", row=1, col=2)
fig.update_yaxes(title_text="D²", row=1, col=2)
fig.update_layout(width=1050, height=420, title="Training diagnostics")
fig.show()
# Visualize the fitted mean function μ(x)
mu_hat = np.exp(np.clip(X @ beta_hat, -20, 20))
order = np.argsort(x)
x_s = x[order]
y_s = y[order]
mu_true_s = mu_true[order]
mu_hat_s = mu_hat[order]
mu_baseline = np.full_like(mu_hat_s, y.mean())
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_s, y=y_s, mode="markers", name="y (counts)", opacity=0.35))
fig.add_trace(go.Scatter(x=x_s, y=mu_true_s, mode="lines", name="true μ(x)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x_s, y=mu_hat_s, mode="lines", name="fitted μ(x)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=x_s, y=mu_baseline, mode="lines", name="mean baseline", line=dict(dash="dash")))
fig.update_layout(
title="Poisson regression fit (log-link): fitted mean vs true mean",
xaxis_title="feature x",
yaxis_title="count / mean",
width=950,
height=450,
)
fig.show()
5) Practical usage in scikit-learn#
The score itself is just:
from sklearn.metrics import d2_tweedie_score
d2 = d2_tweedie_score(y_true, y_pred, power=p)
For modeling, scikit-learn provides GLM-style estimators such as PoissonRegressor, GammaRegressor, and the general TweedieRegressor.
Below is a minimal Poisson example (power=1).
from sklearn.linear_model import PoissonRegressor
from sklearn.model_selection import train_test_split
X_feat = x.reshape(-1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X_feat, y, test_size=0.3, random_state=0)
model = PoissonRegressor(alpha=0.0, max_iter=1000)
model.fit(X_tr, y_tr)
mu_te = model.predict(X_te)
print("sklearn PoissonRegressor D² (power=1):", sk_d2_tweedie_score(y_te, mu_te, power=1))
# Compare to our gradient descent fit on the same split
X_tr_i = np.c_[np.ones(X_tr.shape[0]), X_tr[:, 0]]
X_te_i = np.c_[np.ones(X_te.shape[0]), X_te[:, 0]]
beta_hat_tr, _ = fit_tweedie_glm_gd(X_tr_i, y_tr, power=1.0, lr=0.05, n_iter=3000, record_every=50)
mu_te_gd = np.exp(np.clip(X_te_i @ beta_hat_tr, -20, 20))
print("NumPy GD D² (power=1):", d2_tweedie_score_numpy(y_te, mu_te_gd, power=1))
sklearn PoissonRegressor D² (power=1): 0.5782287829762082
NumPy GD D² (power=1): 0.5782287833631345
6) Pros, cons, and when to use#
Pros
\(D^2\) is interpretable like \(R^2\): 1 is best, 0 is “mean baseline”, negative means worse than baseline.
It is distribution-aware via deviance, making it a natural score for GLMs.
Works well for heteroscedastic targets where variance grows with the mean (counts, positive skew, etc.).
Supports sample weights.
Cons / pitfalls
You must choose a power \(p\); the value (and model ranking) can change with \(p\).
Domain restrictions: for most \(p\ge 1\) you need \(y\ge 0\) and strictly positive predictions \(\hat{\mu}>0\).
Not defined for \(n<2\) samples; can be NaN when the baseline deviance is 0 (e.g., constant targets).
It is a relative score, not an absolute error, and is not symmetric in its arguments.
Good use cases
\(p=1\) (Poisson): counts / event rates.
\(1<p<2\) (compound Poisson–Gamma): data with many zeros plus positive continuous outcomes (common in insurance claims).
\(p=2\) (Gamma) or \(p=3\) (Inverse Gaussian): strictly positive, right-skewed continuous targets.
7) Exercises#
For a fixed \(y>0\), plot \(d_p(y,\mu)\) for several powers and compare the under- vs over-prediction penalty.
Construct a dataset with constant targets and see what happens to \(D^2\) (baseline deviance \(=0\)).
Add sample weights and verify that the baseline prediction becomes the weighted mean.
Simulate data from a Poisson model and compare \(D^2\) for powers \(p=0\) vs \(p=1\).
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.d2_tweedie_score.html
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_tweedie_deviance.html
Hastie, Tibshirani, Wainwright (2015), Statistical Learning with Sparsity, Eq. (3.11)